feat(rocpd): add AI-powered GPU trace analysis module (rocpd analyze) by ammarwa · Pull Request #4030 · ROCm/rocm-systems

ammarwa · 2026-03-13T00:57:48Z

Motivation

AMD ROCm users need actionable guidance from their GPU profiling data, but interpreting raw rocprofv3 output requires deep GPU architecture knowledge. This PR introduces an AI-powered analysis module (rocpd analyze) that reads .rpd trace databases and produces human-readable performance insights, bottleneck detection, and optimization recommendations — without requiring an internet connection or LLM API key.

The module follows a tiered progressive analysis strategy (Tier 0–4), giving users immediately useful output from any profiling run while allowing deeper analysis as more data is collected.

Technical Details

New subcommand: rocpd analyze [-i trace.db] [--source-dir ./src] [--interactive "<app>"]

Core analysis (`analyze.py`, `ai_analysis/`)

Tier 0 — Static source analysis (--source-dir): Scans .hip/.cpp/.cu/.py files for GPU programming patterns (kernels, memcpy, sync, ROCTx, frameworks). Produces a profiling plan with a suggested first rocprofv3 command and recommended PMC counters. Works without a .db file.
Tier 1 — Trace analysis: Time breakdown (kernel/memcpy/API/idle), hotspot identification, memory transfer analysis, 8 rule-based recommendations (high/medium/low/info priority).
Tier 2 — Hardware counter analysis: Roofline model, Speed-of-Light, GPU utilization (GRBM), wave occupancy (SQ_WAVES). Auto-activates when pmc_events table is present.
Output formats: text, json (schema v0.1.x/v0.2.0), markdown, webview (self-contained AMD-themed HTML with SVG gauges, collapsible recommendation cards, sortable hotspot table, hover tooltips, light/dark toggle).
LLM enhancement (--llm anthropic|openai|private): Optional natural-language explanations via Anthropic Claude, OpenAI, or any OpenAI-compatible private/enterprise endpoint. Kernel names and paths are sanitized before transmission. Falls back gracefully when unavailable.
Custom prompts (--prompt): Target the analysis at a specific question (e.g. --prompt "Why is my matmul kernel slow?").
PMC pass splitting: _split_pmc_into_passes() automatically separates TCC-derived counters (FETCH_SIZE, WRITE_SIZE) into dedicated passes to avoid hardware block limit errors (rocprofv3 error code 38).

Interactive workflow (`ai_analysis/interactive.py`)

Two session classes:

InteractiveSession — menu-driven [p]/[a]/[o]/[s]/[q] loop launched after standard analysis:

Persistent LLMConversation shared across all [a]/[o] calls in a session; history survives --resume-session
LLMConversation auto-compacts every N turns (configurable via --llm-compact-every) using an LLM-generated summary to stay within context limits
AI-suggested rocprofv3 commands extracted from LLM responses and offered as a numbered run menu
Session saved to ~/.rocpd/sessions/ on [s], [q], and Ctrl+C

WorkflowSession — 7-phase automated profiling + optimization loop triggered by --interactive "<app>":

Phase 1b: Classifies the app command and scans source (if provided) to pick the optimal starter rocprofv3 flags
Multi-process support: Detects MPI, torchrun, DDP, and other fork-based workloads; automatically adds --process-sync and -o results_%nid% so each process writes its own DB. After profiling, per-process databases are merged via rocpd.merge.merge_sqlite_dbs() before analysis.
Phase 6 AI code editing: LLM rewrites source files based on recommendations; diff shown for approval; .bak backup created before any edit
AI-edit revert: [v]/r reverts the last edit, prompts for error context, calls LLM to analyze the failure and propose an alternative, then shows a what-next menu ([f] retry fix / [p] re-profile / [q] exit)
Cycle-break detection: prevents infinite [r] re-profile → same INFO → [r] loops by fingerprinting collected counters and flags across all prior runs
Session checkpoints: each AI edit batch creates a git commit + GC-pinned ref + browsable worktree; [b] rollback menu in Phase 5 lets users revert to any prior state
Session state persisted to ~/.rocpd/sessions/workflow_<ts>_<slug>.json

WorkflowSession — Session Checkpoints

Each AI source-file edit creates a git-worktree checkpoint so the user can roll back to any prior state and blacklist approaches that caused regressions.

Phase 6 AI edit
  └─► git commit all modified files
  └─► git update-ref refs/rocpd/<session_id>/cp-N  (GC-pinned, not a branch)
  └─► git worktree add --detach ~/.rocpd/sessions/<session_id>/cp-N
  └─► CheckpointRecord stored in WorkflowState
        ├── file_snapshots (full contents — offline restore when git unavailable)
        ├── run_index + performance_delta_pct (filled in after Phase 3/4)
        └── blacklisted flag + description

[b] rollback menu in Phase 5: shows checkpoint table with performance deltas; regression checkpoints flagged; user prompted to blacklist before rollback
Blacklist injection: blacklisted approach descriptions are prepended to Phase 6 LLM prompt so the same pattern is not repeated
Blacklist persistence: stored in WorkflowState.blacklisted_approaches (never truncated by rollback)
Two-strategy restore: git checkout <hash> -- <file> (fast path) or file-snapshot write (fallback when git unavailable)
Dirty working tree OK: commit_files stages only the AI-modified files (git add -- <file>), so in-progress user changes are never touched or included in checkpoint commits
Session lifecycle: _init_checkpoints + _prune_stale_worktrees at start; _teardown_checkpoints in finally (removes worktrees; refs kept for GC protection)

LLM conversation (`ai_analysis/llm_conversation.py`)

New LLMConversation class replacing the previous SessionContext dict approach:

Streaming responses via Anthropic, OpenAI, and private/enterprise OpenAI-compatible APIs
Response chunks accumulated with list.append + "".join() (O(n)) instead of string concatenation (O(n²)) to avoid quadratic allocation on long responses
Automatic context compaction: keeps keep_recent_turns verbatim, summarizes older turns with a non-streaming LLM call
Conversation history archived to ~/.rocpd/sessions/<id>_history.jsonl
to_dict()/from_dict() for full session persistence and resume

LLM hardening

ROCPD_LLM_PRIVATE_HEADERS dict validation: After json.loads() the result is validated to be a dict; a non-dict JSON value (e.g. an array) raises a ValueError with a clear message showing the expected format, rather than an opaque TypeError from headers.update()

Build & packaging (`utilities.cmake`)

file(COPY ... DESTINATION ...) replaces configure_file ... COPYONLY for AI analysis assets — fixes EPERM on binary files (e.g. PNG) during CMake configure
*.png added to rocpd_AI_ANALYSIS_FILES glob so ai_analysis/share/amd_rocm_logo.png (used by the interactive session banner) is installed alongside .py/.md/.json files
tracelens_port.py added to rocpd_PYTHON_SOURCES
GPU-less CMake build fix: guards list(GET ...) calls in rocprofiler-sdk-utilities.cmake with an early-return when rocminfo returns an empty GPU list. Note: GPU is required to run the integration tests; builds on GPU-less machines configure cleanly but the test suite is not expected to pass without hardware.

Python 3.6 compatibility (RHEL 8.8 / SLES 15.6)

tracelens_port.py: Changed _CATEGORY_PATTERNS: List[Tuple[str, re.Pattern]] annotation to List[Tuple[str, Any]]. re.Pattern was introduced in Python 3.7; Python 3.6 evaluates module-level annotations eagerly, causing an AttributeError at import time that cascaded into all tests importing analyze.py or llm_analyzer.py.
test_analyze_schema.py: Added try/except ImportError shim for importlib.resources (Python 3.7+), falling back to pkgutil.get_data() on Python 3.6.

Schema file corrections

The analysis-output.schema.json file was corrected to match the already-documented v0.2.0 specification. The emitted JSON format was never wrong; only the validator was:

Bug	Fix
`profiling_mode` enum missing `"source_only"`	Added
`analysis_tier` minimum was `1`	Lowered to `0`
`execution_breakdown` type `"object"` only	Changed to `["object", "null"]`
`tier0` property undeclared	Added full property definition with 14 sub-fields
`$id` embedded version string	Changed to stable `"rocpd-ai-analysis-output"`

Tier 0 source-only JSON output (schema_version: "0.2.0") now passes jsonschema.validate().

Tests

tests/rocprofv3/rocpd/test_analyze.py — 76 unit tests covering all recommendation rules, helper functions, PMC filter, and output formatters
tests/rocprofv3/rocpd/test_analyze_schema.py — 28 JSON schema conformance tests (v0.1.x, v0.2.0 source-only, and combined Tier 0+Tier 1/2; was 17)
tests/rocprofv3/rocpd/test_ai_analysis_standalone.py — 23 Python API unit tests (analyze_database, analyze_source, AnalysisResult)
tests/rocprofv3/rocpd/test_guide_filter_standalone.py — LLM reference guide section filter tests
ai_analysis/tests/test_interactive.py — 22 interactive session unit tests
ai_analysis/tests/test_llm_conversation.py — LLMConversation streaming/compaction/persistence tests
ai_analysis/tests/test_workflow.py — 52 WorkflowSession phase tests including full checkpoint system coverage (CheckpointRecord, GitCheckpointManager, rollback, blacklist, teardown, stale pruning)

JIRA ID

N/A

Test Plan

Unit tests run with pytest --noconftest from the build output directory
Integration tests run via ctest -R rocpd-analyze after a full build (requires AMD GPU)
Manual end-to-end testing with merged_db.db (2000 kernel dispatches + 64000 PMC samples) for Tier 1/2 analysis and all four output formats
Interactive workflow tested against a HIP demo app with intentional performance issues for Phase 1b workload classification, Phase 6 AI code editing, and the revert flow
Checkpoint system tested with mock git operations: rollback (git fast path + snapshot fallback), blacklist persistence across rollbacks, worktree teardown/prune
CMake configure verified on a system without AMD GPUs (configure succeeds; tests require GPU)

Test Result

All 76 test_analyze.py unit tests pass
All 28 schema conformance tests pass (including 11 new Tier 0 / combined tests)
All 23 AI analysis API tests pass
All 52 test_workflow.py tests pass (checkpoint system coverage)
All 22 test_interactive.py tests pass (no regressions)
All 51 test_llm_conversation.py tests pass (no regressions)
All output formats (text/json/markdown/webview) verified with real trace data
jsonschema.validate() passes for Tier 0, Tier 1/2, and combined JSON output
CMake configures cleanly on both GPU and GPU-less systems

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

analyze.py: - Bug: execute() passed CLI key 'format' to analyze_performance() which expects 'output_format', so --format json/markdown was silently ignored and text was always written. Fix by mapping the key before the call. cmake/Modules/rocprofiler-sdk-utilities.cmake: - rocprofiler_sdk_pc_sampling_disabled and rocprofiler_sdk_pc_sampling_stochastic_disabled called list(GET ...) on the result of rocprofiler_sdk_get_gfx_architectures without guarding against an empty list. On build machines without GPUs (CI containers, cross-compile hosts) CMake configure failed with "list GET given empty list". Add length check and early-return with PC sampling disabled when no GPUs are present. tests/CMakeLists.txt: - rocprofiler-sdk-tests-gfx-info was left empty on no-GPU hosts, causing all sub-CMakeLists that do list(GET rocprofiler-sdk-tests-gfx-info 0 ...) to fail at configure time. Populate the variable with placeholder "gfx000" when no hardware is detected; this matches none of the known GPU patterns so all hardware-dependent tests are correctly disabled while configure completes without errors. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Add _format_as_webview() function: self-contained HTML report with AMD dark theme, interactive sortable tables, SVG donut gauges for GPU util and wave occupancy, collapsible recommendation cards with priority color-coding, stacked execution breakdown bar, and copy-to-clipboard profiling commands. No external CDN dependencies. - Wire 'webview' format into format_analysis_output() dispatch - Add 'webview' to --format CLI choices (text/json/markdown/webview) - Fix output file extension: execute() now appends .txt/.json/.md/.html automatically based on the selected format, so output files always have the correct extension Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- README.md: add webview to feature list, CLI examples, data flow diagram, AnalysisResult method list, and a new Example 4 section - AI_ANALYSIS_API.md: add webview to feature list and AnalysisResult methods; document each format's output file extension (.txt/.json/ .md/.html); add full Webview section under Output Formats covering features, CLI usage, and Python API usage - SCHEMA_CHANGELOG.md: add v0.1.1 entry noting webview format addition and auto-extension behavior (no JSON schema changes) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add a pure CSS+JS floating tooltip system to the webview HTML report so every visual element explains itself on hover. No external deps. Tooltips added to: - Gauge widgets (GPU Utilization, Wave Occupancy): explain the underlying hardware counter formula (GRBM_GUI_ACTIVE/GRBM_COUNT, SQ_WAVES), target thresholds, and current status - Execution breakdown: stacked bar segments and individual bars for Kernel Execution, Memory Copies, API Overhead, and GPU Idle — each explains what the metric means, good/bad thresholds, and how to fix - Overview stat cards: Primary Bottleneck (per-type explanation of what it means and how to address it), Total Runtime, Kernel Time, Analysis Tier (explains Tier 1 vs Tier 2 and how to upgrade) - Hotspot table column headers: Calls, Total/Avg/Min Time, % Total - Memory transfer table: direction cells (H2D, D2H, D2D, P2P with PCIe/HBM bandwidth context) and all column headers - Hardware counter table rows (via COUNTER_TIPS JS lookup): GRBM_COUNT, GRBM_GUI_ACTIVE, SQ_WAVES, SQ_WAVE_CYCLES, SQ_INSTS_VALU/SALU/VMEM_RD/VMEM_WR/LDS/SMEM, FETCH_SIZE, WRITE_SIZE, TCP/TCC cache counters, TA_TA_BUSY, and more. Unknown counters get a generic fallback message. Implementation details: - #tt floating div follows mouse cursor, repositions at viewport edges - [data-tip] elements use single-quoted HTML attributes; tip content can include <strong>, <em>, <code>, .tok/.twarn colored spans - Counter tips use data-ctr attribute + JS COUNTER_TIPS object lookup to decouple tip content from Python string generation Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- AI_ANALYSIS_API.md: expand Webview features list with full tooltip coverage details — gauges (counter formula, thresholds), breakdown bars, overview stats (per-bottleneck guidance), hotspot columns, memory direction cells, and 20+ AMD GPU hardware counter definitions - README.md: add tooltip note to Example 4 (Interactive HTML Webview) explaining that every visual element is self-documenting on hover - SCHEMA_CHANGELOG.md: add v0.1.2 entry — no schema changes; notes the COUNTER_TIPS JS lookup, tooltip coverage, and fallback behavior for unknown counters Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Overhaul the --format webview HTML report inspired by AMD dashboard design patterns for a cleaner, more scannable interface: - Light/Dark theme toggle with localStorage persistence (defaults dark) - Sticky header with AMD gradient, status summary badges (Critical/ Warning/Low/Info counts from recommendations), and metric pills row (runtime, kernel count, analysis tier, timestamp, DB path) - Status-colored KPI cards in overview: kernel %, bottleneck type, total runtime, and tier each have a colored top border (ok/warn/crit) reflecting health status at a glance - Section card pattern (.scard) with icon+title+badge headers throughout - Priority icons on recommendation cards: 🔴 HIGH 🟠 MEDIUM 🟡 LOW ℹ INFO - Gradient execution breakdown bars and grid-aligned legend rows - FAB scroll-to-top button (appears after 250px scroll) - Staggered @Keyframes fadeInUp entrance animations on section cards - Improved typography (system font stack; works fully offline) - Gauge cards: background fill + hover border effect (Tier 2) - Improved table headers: uppercase + 2px bottom border Also updates SCHEMA_CHANGELOG.md (v0.1.3), README.md, and AI_ANALYSIS_API.md to document all new webview UI features. No changes to JSON output schema or analysis logic. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

CSS `content` property does not process HTML entities. Replace `content:'→'` with `content:'→'` (U+2192) in the .findings li::before rule so the right-arrow bullet renders correctly instead of displaying as literal text '→'. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Documents the root cause and fix for the key findings bullet icons rendering as literal HTML entity text in the webview report. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The #tt floating tooltip used color:var(--text) which in light mode resolves to ~#181828 (near-black) — invisible against the always-dark #0e0e1c tooltip background. Replace with a fixed light color (#dde0f2) so the tooltip remains readable regardless of the active theme. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Documents the root cause and fix for tooltip text being invisible in light theme (color:var(--text) resolving to near-black against an always-dark tooltip background). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The recommendation engine was suggesting rocprofv3 flags (e.g. --hip-api-trace, --hsa-trace) that were already covered by the user's original --sys-trace run, creating confusing advice. Fix: inspect the database before generating recommendations to infer which collection flags were already used: - kernels rows → --kernel-trace covered - regions rows → --hip-trace / --hsa-trace covered (API spans) - memory_copies rows → --memory-copy-trace covered - kernels + regions → full --sys-trace implied (subsumes all trace flags) Redundant flags are stripped from recommended rocprofv3 commands. Commands whose stripped flags leave nothing new to collect are dropped entirely. rocprof-sys and rocprof-compute commands are always preserved (different tool, always a new perspective). New helpers: _detect_already_collected(), _filter_rec_commands(), _SYS_TRACE_IMPLIED constant. generate_recommendations() gains an already_collected parameter; analyze_performance() calls the detector and threads the result through. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…esent rocprof-sys --trace collects the same HIP/HSA API call data as rocprofv3 --sys-trace (just in Perfetto format instead of rocpd). Treat it as equivalent and drop it when sys-trace data is already in the database. Rules in _filter_rec_commands() are now per-tool: - rocprofv3: strip covered flags; drop if nothing meaningful remains - rocprof-sys: drop if only --trace (≡ sys-trace); keep when it carries extra flags like --trace-gpu-memory that rocprofv3 can't - rocprof-compute: always keep (deep hardware counter analysis) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The LLM was recommending flags already covered by the user's original --sys-trace run (e.g. --hip-api-trace, --hsa-trace, rocprof-sys --trace). Add a new "Context-Aware Profiling Recommendations" section to the LLM reference guide (the "fence") that explicitly instructs the model to: 1. Read profiling_info.profiling_mode to identify what was already collected 2. Know that --sys-trace subsumes --hip-trace, --hsa-trace, --hip-api-trace, --kernel-trace, --memory-copy-trace, --marker-trace, --roctx-trace 3. Know that rocprof-sys --trace is equivalent to --sys-trace (same API data, different format) and must not be recommended when sys-trace exists 4. Only recommend the INCREMENTAL next step (--pmc, rocprof-compute, etc.) 5. State "no additional run needed" when all required data is present Also add an explicit prohibition in the "What NOT to Do" section. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Instead of enumerating every flag equivalence (--sys-trace subsumes --hip-trace, --hsa-trace, etc.), instruct the LLM to reason from the tool documentation already present in the guide to determine flag overlap and tool equivalence itself. The "Context-Aware Profiling Recommendations" section is now concise: tell the model what to do (read profiling_mode, use the docs to reason about equivalence, recommend only the incremental next step) without hardcoding every combination that should be in the model's reasoning. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Suppresses .claude/, __pycache__/, *.pyc, and rocpd-output-data/ from appearing as untracked files in git status. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Fixes all 13 issues from the deep-research-report audit: Critical: - AIA-001: fix analyze_database() — call individual analysis functions (compute_time_breakdown, identify_hotspots, analyze_memory_copies, analyze_hardware_counters, generate_recommendations) instead of the broken analyze_performance() wrapper that returns str not dict High: - AIA-002: fix _build_analysis_result() key mapping (issue/suggestion/ estimated_impact/actions, uppercase priority comparison) - AIA-003: add WEBVIEW to OutputFormat enum - AIA-004: fix to_json() to return schema-conformant output via format_analysis_output(); add to_webview() method; store raw payloads as result._raw for schema-conformant serialization - AIA-012: create ai_analysis/tests/test_api_standalone.py (23 tests) and tests/rocprofv3/rocpd/test_ai_analysis_standalone.py; update docs Medium: - AIA-005: re-raise LLMAuthenticationError/LLMRateLimitError instead of silently downgrading to warnings - AIA-006: fix _convert_result_to_llm_format() to use real hotspot/ memory/counter data from result._raw instead of empty placeholders - AIA-007: implement file path redaction in _sanitize_data() using regex - AIA-008: ReferenceGuideNotFoundError now lists all attempted paths; get_reference_guide_path() collects all paths before raising - AIA-009: add DEFAULT_ANTHROPIC_MODEL/DEFAULT_OPENAI_MODEL constants; model names configurable via ROCPD_LLM_MODEL env var and new --llm-model CLI flag - AIA-013: fix validate_database() to query type IN ('table','view') Low: - AIA-010: fix Optional type hints in exceptions.py - AIA-011: export ReferenceGuideNotFoundError from __init__.py Additional: - Add --llm-model CLI flag to rocpd analyze (passes model to LLMAnalyzer via ROCPD_LLM_MODEL env var with proper save/restore) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

sanitize_input_list() iterates over its argument, so passing a plain str causes it to iterate over individual characters (e.g. 'p', 'r', 'o', ...). Wrap the single path string in a list in both analyze_database() and validate_database() so the path is treated as one item. Fixes: analyze_database() returning 0 kernels when called via the Python API even though the CLI works correctly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Add if __name__ == '__main__' entry point to test_ai_analysis_standalone.py so it can be invoked directly by Python (required for CTest integration) - Add configure_file() to copy test file to build directory at cmake time - Add rocprofiler_add_integration_execute_test() registering rocprofv3-test-rocpd-ai-analysis-unit-tests (test #597) with labels integration-tests;rocpd;pytest and 120s timeout - 23 tests pass via: ctest -R rocprofv3-test-rocpd-ai-analysis-unit-tests -V Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…fy output format - llm_analyzer.py: try max_completion_tokens first (required by gpt-5, o1, o3, and newer gpt-4o variants); fall back to legacy max_tokens transparently if the model reports max_completion_tokens as unsupported (old models) - analyze.py: print a format hint when output defaults to text (.txt), so users know to add --format webview / --format json / --format markdown Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…ters The recommendation engine was suggesting commands like: rocprofv3 --pmc GRBM_COUNT GRBM_GUI_ACTIVE SQ_WAVES ... even when those exact counters were already present in pmc_events. Root cause: _detect_already_collected() tracked trace flags (--sys-trace, --kernel-trace, etc.) but never inspected pmc_events for counter names. _filter_rec_commands() only checked command flags, not --pmc arg values. Fixes: - _detect_already_collected(): query pmc_events for DISTINCT counter_name; add "pmc:<NAME>" entries to the covered frozenset for each counter found - _filter_rec_commands(): for rocprofv3 commands, strip already-collected counters from the --pmc arg value; drop --pmc entirely if all counters are covered; treat --kernel-names as a scope filter (not data collection) so a command reduced to only scope+output args is dropped cleanly; append note listing removed counters to recommendation description - Add 7 unit tests covering full/partial/zero PMC stripping, full_command update, description note, kernel-names-only drop, and rocprof-compute always-kept behavior Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

… hint SCHEMA_CHANGELOG.md — add v0.1.8 entry covering: - PMC counter deduplication: _detect_already_collected() now inspects pmc_events; _filter_rec_commands() strips already-collected counters from --pmc args and drops fully-redundant commands - OpenAI max_completion_tokens compatibility for gpt-5/o1/o3 - Output format hint when text is the default - CTest registration of 23 AI analysis API unit tests AI_ANALYSIS_API.md: - Add "Recommendation Deduplication" section explaining the PMC and trace-flag deduplication table and behavior - Note OpenAI model compatibility (max_completion_tokens auto-fallback) CLAUDE.md: - Bump schema version reference: v0.1.1 → v0.1.8 - Update test count: 69 → 76 (7 new PMC filter tests) - Add PMC deduplication and OpenAI compat notes to Python API section Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…nter knowledge Incorporate knowledge from four AMD ROCm profiling blog articles to improve LLM-guided analysis quality and progressive recommendation accuracy. Key additions: - Recommended AMD 3-step profiling workflow: rocprof-sys (system timeline) → rocprofv3 (hardware counters on hot kernels) → rocprof-compute (deep analysis); guide LLM to recommend only the incremental next step - Amdahl's Law as the core prioritization principle (focus on kernels >10% of total time only) - VGPR→Occupancy table for all CDNA architectures (32/64/96/128/168/256 VGPRs mapped to occupancy %) - Hardware Counter Reference table with 10+ counters and derived metric formulas (GPU utilization, BW, L2 hit rate, VALU util, LDS util) - Bandwidth formula: (FETCH_SIZE + WRITE_SIZE) * 64 bytes / duration_ns - Memory Hierarchy section: VGPR→LDS→L1→L2→HBM with per-GPU cache sizes and hit-rate thresholds that indicate problems - LDS bank conflicts: 32 banks, detection and avoidance patterns - API/Launch Overhead as a new explicit bottleneck type - ILP and HIP Streams as new optimization techniques - Multi-GPU/MPI profiling guidance in the rocprof-sys section - Ridge points per GPU: MI300X ~31, MI250X ~15, MI100 ~19 FLOP/Byte - Confidence level examples with concrete counter-based phrasing - Expanded GPU specs: SIMDs per CU (4), max waves per SIMD (8), L1 sizes Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Implements interactive.py with SessionData, PersistentMenuItem, HistoryEntry dataclasses and SessionStore (save/load/find_by_source_dir) for --interactive session file I/O under ~/.rocpd/sessions/. All 5 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Wrap load() body in try/except; on failure emit warnings.warn and return None - Replace lambda sort key in find_by_source_dir with _safe_dt() using datetime.fromisoformat + fallback to datetime.min - Remove redundant 'import dataclasses' inside to_dict() (already at module level) - Widen SessionStore.__init__ type hint to Union[str, pathlib.Path]; add Union to imports - Add 5 new tests: malformed JSON skipped, make_session_id slug/spaces/fallback, newest-first ordering Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

… resume prompt Implements Task 2 of the interactive session feature: - Add rendering helpers (_print, _input, _PRI_STYLE) with optional rich console support - Add InteractiveSession class with main event loop, session init/resume logic, and save-on-quit - Add _prompt_resume() for auto-detecting and offering to resume prior sessions - Add _render_main_menu() showing persistent menu items from previous analyses - Add stubs for _path_profiling(), _path_optimize(), _pursue_recommendation() - Add TestInteractiveSessionMenu with 3 tests (new session, quit-saves, resume-loads) - All 13 tests pass (10 existing TestSessionStore + 3 new TestInteractiveSessionMenu) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…remove duplicate import - Wrap `_input()` in `run()` with try/except EOFError to call save-and-quit gracefully - Print feedback message in `_prompt_resume()` when selection is out of range or unrecognized - Remove duplicate `from rich.panel import Panel` inside `_render_main_menu()` (module-level import already covers it) - Add 4 new tests: [s] save without quit, EOF exits cleanly, numeric entry pursues recommendation, invalid resume choice starts new session (17 tests total, all passing) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…k 3) Replace the _path_profiling stub with a full implementation that displays profiling commands from tier0 and existing recommendations, optionally annotates them via LLM (metadata only, no source text), prompts for a .db file path, runs Tier 1/2 analysis, and promotes resulting recommendations to the persistent menu. Add _collect_profiling_commands, _llm_annotate_profiling_plan, and _run_tier1_analysis helpers. Add TestPathProfiling with 2 tests; all 19 tests pass. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add _update_checkpoint_with_run() to WorkflowSession that finds the most recent CheckpointRecord without a run attached, sets its run_index to the latest trace_history index, and computes performance_delta_pct from total_runtime_ns when two or more analysis snapshots are available. Hook the method into _phase3_run_profiler after both successful trace-run save sites (trace-files-found path and manual-DB-entry path). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…parate methods _update_checkpoint_with_run() was computing performance_delta_pct from analysis_history before Phase 4 had appended the current run's analysis, causing delta to always read stale data. Refactor: Phase 3 only sets run_index via the existing method; new _update_checkpoint_delta() is called from Phase 4 after _record_analysis() so analysis_history[-1] is always the current run. Add test_update_checkpoint_delta_noop_when_insufficient_history. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add _rollback_to_checkpoint, _blacklist_checkpoint, and _build_blacklist_block to WorkflowSession, plus _restore_from_snapshots helper. Rollback uses git fast path when commit is reachable, falls back to file_snapshots otherwise. 9 new tests added; all 45 workflow tests pass. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…ves session - Remove early `return` in `_rollback_to_checkpoint` when target_cp_id==-1 and git is unavailable: execution now falls through to the cleanup section so checkpoints, trace_history, analysis_history, and iteration_count are always cleared even when file restore is impossible. - Add `self._save_session()` at the end of `_blacklist_checkpoint` so the blacklisted flag is persisted to disk immediately after mutation. - Add test `test_rollback_baseline_no_git_still_clears_state` to verify the baseline-no-git path clears all state (46 tests total, all passing). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add _show_checkpoint_picker() to WorkflowSession displaying a checkpoint table with performance deltas and prompting for optional blacklisting of regression checkpoints before restoring. Wire [b] into _phase5_rec_menu across all three menu paths (already_reprofiled, all_info, HIGH/MEDIUM). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Validate cp_id against actual cp_id set (not list length) to handle non-contiguous ids - Show blacklist prompt for baseline rollback (not just partial rollbacks) - Replace raw input() calls with _input() wrapper for EOFError safety - Strengthen test assertion to verify _blacklist_checkpoint called with correct cp_id Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

When _build_blacklist_block() returns a non-empty string, prepend it to the suggestions passed to _llm_rewrite_file so the LLM avoids previously failed approaches when rewriting source files. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add _teardown_checkpoints() to remove all checkpoint worktrees when a WorkflowSession exits (refs are preserved for GC protection). Add _prune_stale_worktrees() to clean up orphaned worktrees from crashed sessions at startup. Both are hooked into run(): pruning after _init_checkpoints(), teardown in the finally block. Current session worktrees are never pruned. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…e path Wrap remove_worktree calls in _teardown_checkpoints with try/except so any exception (e.g. FileNotFoundError when git is missing) cannot propagate out of the finally block and suppress _save_session. Also add an early-return guard in GitCheckpointManager.remove_worktree for empty worktree_path strings, preventing a spurious git error. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- CheckpointRecord dataclass with file_snapshots for offline restore - GitCheckpointManager: git commit + update-ref + worktree add --detach per edit - WorkflowState: repo_root, baseline_commit, checkpoints, active_checkpoint - Phase 6: creates checkpoint after each AI edit batch - Phase 3: records run_index and performance_delta_pct per checkpoint - Phase 5: [b] rollback menu with checkpoint picker and blacklist prompt - Blacklist: uses edit_summary directly; deduplicates; injects into Phase 6 LLM prompt - Session exit: removes worktrees (refs stay for GC protection) - Session start: dirty-tree abort; stale worktree pruning - Fix: remove spurious _conv attribute from WorkflowSession (test_workflow_session_has_no_conv) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Fix blacklist lost after rollback: persist blacklisted_approaches on WorkflowState - Fix suggestions accumulating blacklist prefix on each retry: use effective_suggestions - Fix cp_id lookup: use search-by-id instead of list index in rollback and blacklist - Fix _gcm left set after dirty-tree abort in _init_checkpoints - Fix pathlib.Path.exists mock in test to use return_value=False Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…API.md

…of aborting

…e committed

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…ck formatting - Remove is_dirty() from GitCheckpointManager — dirty working tree is not an obstacle because commit_files uses git add -- <specific_file> which only stages the exact files modified by each AI edit, leaving other in-progress changes untouched - Remove the dirty-tree guard from _init_checkpoints() so sessions continue normally even when the repo has uncommitted changes - Fix flake8 F841 in remove_worktree: drop unused result = assignment - Apply black formatting to interactive.py and test_workflow.py - Update tests: replace test_session_start_aborts_when_dirty with test_checkpoints_work_with_dirty_tree confirming checkpoints initialise successfully despite a dirty tree; remove two now-deleted is_dirty tests Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Copilot

Pull request overview

This PR adds a new rocpd analyze module to generate offline, human-readable GPU trace insights (with optional LLM enhancement), plus supporting packaging/build integration and a substantial unit/integration test suite.

Changes:

Adds AI analysis Python package (rocpd.ai_analysis), including persistent LLM conversation support and TraceLens-derived analysis utilities.
Integrates the new analyze subcommand into the rocpd CLI and CMake test/packaging flows.
Introduces extensive standalone/unit/integration tests for schema conformance, guide filtering, interactive workflow/checkpoints, and TraceLens port logic.

Reviewed changes

Copilot reviewed 29 out of 36 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_guide_filter_standalone.py	Adds standalone tests for guide section tag selection and filtering logic.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_analyze_schema.py	Adds schema structure + conformance tests (incl. Tier 0 and combined outputs) with Py3.6 shim.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/test_ai_analysis_standalone.py	Adds standalone API + regression tests for AI analysis behaviors and security/correctness fixes.
projects/rocprofiler-sdk/tests/rocprofv3/rocpd/CMakeLists.txt	Wires `rocpd analyze` into integration tests and runs standalone pytest-based test scripts.
projects/rocprofiler-sdk/tests/pytest-packages/pytest_utils/perfetto_reader.py	Minor SQL formatting cleanup in trace reader query.
projects/rocprofiler-sdk/source/scripts/format-deps.py	Removes unused import and reformats argparse definition.
projects/rocprofiler-sdk/source/lib/python/utilities.cmake	Installs `analyze.py`, `tracelens_port.py`, and copies `ai_analysis` runtime assets (excluding tests).
projects/rocprofiler-sdk/source/lib/python/rocpd/tracelens_port.py	Adds TraceLens-derived interval/categorization/short-kernel analysis utilities.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_workflow.py	Adds mock-based tests for workflow session phases and git checkpoint manager behavior.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_tracelens_port.py	Adds unit + optional integration tests for `tracelens_port` functions.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_local_llm.py	Adds tests for local OpenAI-compatible endpoint provider behavior.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_llm_conversation.py	Adds tests for streaming, compaction, persistence, and interactive integration for `LLMConversation`.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_interactive.py	Adds tests for session storage/menu behavior and profiling/optimize flows.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/test_api_standalone.py	Adds standalone tests for public API, exceptions, serialization, and recommendation bucketing.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/tests/init.py	Marks ai_analysis tests package.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/share/amd_rocm_logo.png	Adds branding asset used by interactive UI.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/llm_conversation.py	Introduces persistent multi-turn LLM session with streaming + compaction + archive.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/exceptions.py	Adds typed exception hierarchy for AI analysis module.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/docs/LLM_GUIDE_SECTIONS.md	Documents context-tagged guide section filtering system.
projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/init.py	Exposes public AI analysis API surface + lazy interactive imports.
projects/rocprofiler-sdk/source/lib/python/rocpd/main.py	Adds `rocpd analyze` CLI subcommand and argument validation.
projects/rocprofiler-sdk/source/bin/rocprofv3.py	Formatting-only change to env var update call.
projects/rocprofiler-sdk/cmake/Modules/rocprofiler-sdk-utilities.cmake	Avoids `list(GET ...)` errors when no GPUs are detected at configure time.
.gitignore	Ignores Claude session data, Python bytecode, and generated analysis output directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/llm_conversation.py

projects/rocprofiler-sdk/tests/rocprofv3/rocpd/CMakeLists.txt

…compat in CMake schema test - llm_conversation.py: after parsing ROCPD_LLM_PRIVATE_HEADERS, validate the result is a dict and raise a clear ValueError if it is not (e.g. if the env var was set to a JSON array or string instead of an object) - tests/rocprofv3/rocpd/CMakeLists.txt: replace importlib.resources.files() with pkgutil.get_data() in the inline schema-validate test so it works on Python 3.6 where importlib.resources.files() is not available; also replace f-strings with str concatenation for broad Python compatibility Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…own tag - test_ai_analysis_standalone.py: test_kernel_name_shell_quoted_in_full_command was filtering for rocprofv3 commands but the kernel name only appears in the rocprof-compute command (rocprofv3 collects general PMC counters without kernel-name scoping). Switch filter to rocprof-compute where shlex.quote() is correctly applied. - test_guide_filter_standalone.py: add tracelens_metrics to KNOWN_TAGS — this tag is used in llm_analyzer.py (_select_tags adds it when TraceLens data is present) and tagged in llm-reference-guide.md, but was missing from the vocabulary guard set causing test_all_tags_are_from_known_vocabulary to fail. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…or test scripts configure_file(COPYONLY) runs only at cmake configure time, leaving stale copies in the build directory when test files are edited during development. Introduce rocpd_stage_test_script() helper function that uses: add_custom_command(OUTPUT ... DEPENDS <src>) + add_custom_target(ALL ...) This means cmake --build re-copies any test file whose source has changed, without requiring the developer to re-run cmake configure. Also adds set_property(CMAKE_CONFIGURE_DEPENDS) so cmake does re-configure automatically when a CI system or fresh checkout triggers it. Replace all configure_file COPYONLY calls for Python test scripts (both the tests/rocprofv3/rocpd/ originals and the ai_analysis/tests/ sub-package copies) with rocpd_stage_test_script(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

ammarwa and others added 30 commits March 12, 2026 19:25

Adding ROCpd Analysis

f25aade

Fixing ROCPD Analysis Tests

499ee79

Adding JSON Scheme

abe80ff

docs(ai_analysis): add v0.1.4 changelog entry for CSS content fix

a356c57

Documents the root cause and fix for the key findings bullet icons rendering as literal HTML entity text in the webview report. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

chore: add .gitignore entries for Claude session data and Python cache

dc9cf59

Suppresses .claude/, __pycache__/, *.pyc, and rocpd-output-data/ from appearing as untracked files in git status. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

ammarwa and others added 14 commits March 13, 2026 00:45

docs(checkpoints): document session checkpoint system in AI_ANALYSIS_…

4432ff3

…API.md

fix(checkpoints): silently disable checkpoints on dirty tree instead …

c109c7e

…of aborting

fix(checkpoints): remove dirty-tree check — only AI-modified files ar…

37e9f6d

…e committed

ammarwa requested a review from Copilot March 13, 2026 06:48

Copilot AI reviewed Mar 13, 2026

View reviewed changes

ammarwa and others added 2 commits March 13, 2026 01:50

refactor(checkpoints): remove unused is_dirty method and update docs

b22d4ba

ammarwa requested a review from Copilot March 13, 2026 06:55

Copilot started reviewing on behalf of ammarwa March 13, 2026 06:56 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

projects/rocprofiler-sdk/source/lib/python/rocpd/ai_analysis/llm_conversation.py Show resolved Hide resolved

projects/rocprofiler-sdk/tests/rocprofv3/rocpd/CMakeLists.txt Show resolved Hide resolved

projects/rocprofiler-sdk/tests/rocprofv3/rocpd/CMakeLists.txt Outdated Show resolved Hide resolved

ammarwa and others added 3 commits March 13, 2026 02:03

ammarwa marked this pull request as ready for review March 13, 2026 07:25

ammarwa requested review from a team as code owners March 13, 2026 07:25

ammarwa requested a review from bwelton March 13, 2026 07:25

ammarwa added the ready for peer review label Mar 13, 2026

ammarwa requested a review from bgopesh March 13, 2026 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rocpd): add AI-powered GPU trace analysis module (rocpd analyze)#4030

feat(rocpd): add AI-powered GPU trace analysis module (rocpd analyze)#4030
ammarwa wants to merge 161 commits intodevelopfrom
aelwazir/rocpd-ai-analysis

ammarwa commented Mar 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ammarwa commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Core analysis (analyze.py, ai_analysis/)

Interactive workflow (ai_analysis/interactive.py)

WorkflowSession — Session Checkpoints

LLM conversation (ai_analysis/llm_conversation.py)

LLM hardening

Build & packaging (utilities.cmake)

Python 3.6 compatibility (RHEL 8.8 / SLES 15.6)

Schema file corrections

Tests

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ammarwa commented Mar 13, 2026 •

edited

Loading

Core analysis (`analyze.py`, `ai_analysis/`)

Interactive workflow (`ai_analysis/interactive.py`)

LLM conversation (`ai_analysis/llm_conversation.py`)

Build & packaging (`utilities.cmake`)